Machine Learning - Assignment 2¶

Spotify and Youtube Dataset Analysis¶

This notebook demonstrates Data Exploration & Visualization, Pre-processing, Model building and Training, Clusters of the Spotify and youtube dataset on Kaggle.

Let's start by importing the necessary libraries:

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, classification_report, f1_score, ConfusionMatrixDisplay, silhouette_score

Part A:Data Exploration & Visualization¶

Data Loading and Initial Exploration¶

In [8]:
# Load the dataset
df = pd.read_csv("Spotify_Youtube.csv", index_col=0) 
# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()
First 5 rows of the dataset:
Out[8]:
Artist Url_spotify Track Album Album_type Uri Danceability Energy Key Loudness ... Url_youtube Title Channel Views Likes Comments Description Licensed official_video Stream
0 Gorillaz https://open.spotify.com/artist/3AA28KZvwAUcZu... Feel Good Inc. Demon Days album spotify:track:0d28khcov6AiegSCpG5TuT 0.818 0.705 6.0 -6.679 ... https://www.youtube.com/watch?v=HyHNuVaZJ-k Gorillaz - Feel Good Inc. (Official Video) Gorillaz 693555221.0 6220896.0 169907.0 Official HD Video for Gorillaz' fantastic trac... True True 1.040235e+09
1 Gorillaz https://open.spotify.com/artist/3AA28KZvwAUcZu... Rhinestone Eyes Plastic Beach album spotify:track:1foMv2HQwfQ2vntFf9HFeG 0.676 0.703 8.0 -5.815 ... https://www.youtube.com/watch?v=yYDmaexVHic Gorillaz - Rhinestone Eyes [Storyboard Film] (... Gorillaz 72011645.0 1079128.0 31003.0 The official video for Gorillaz - Rhinestone E... True True 3.100837e+08
2 Gorillaz https://open.spotify.com/artist/3AA28KZvwAUcZu... New Gold (feat. Tame Impala and Bootie Brown) New Gold (feat. Tame Impala and Bootie Brown) single spotify:track:64dLd6rVqDLtkXFYrEUHIU 0.695 0.923 1.0 -3.930 ... https://www.youtube.com/watch?v=qJa-VFwPpYA Gorillaz - New Gold ft. Tame Impala & Bootie B... Gorillaz 8435055.0 282142.0 7399.0 Gorillaz - New Gold ft. Tame Impala & Bootie B... True True 6.306347e+07
3 Gorillaz https://open.spotify.com/artist/3AA28KZvwAUcZu... On Melancholy Hill Plastic Beach album spotify:track:0q6LuUqGLUiCPP1cbdwFs3 0.689 0.739 2.0 -5.810 ... https://www.youtube.com/watch?v=04mfKJWDSzI Gorillaz - On Melancholy Hill (Official Video) Gorillaz 211754952.0 1788577.0 55229.0 Follow Gorillaz online:\nhttp://gorillaz.com \... True True 4.346636e+08
4 Gorillaz https://open.spotify.com/artist/3AA28KZvwAUcZu... Clint Eastwood Gorillaz album spotify:track:7yMiX7n9SBvadzox8T5jzT 0.663 0.694 10.0 -8.627 ... https://www.youtube.com/watch?v=1V_xRb0x9aw Gorillaz - Clint Eastwood (Official Video) Gorillaz 618480958.0 6197318.0 155930.0 The official music video for Gorillaz - Clint ... True True 6.172597e+08

5 rows × 27 columns

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 20718 entries, 0 to 20717
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Artist            20718 non-null  object 
 1   Url_spotify       20718 non-null  object 
 2   Track             20718 non-null  object 
 3   Album             20718 non-null  object 
 4   Album_type        20718 non-null  object 
 5   Uri               20718 non-null  object 
 6   Danceability      20716 non-null  float64
 7   Energy            20716 non-null  float64
 8   Key               20716 non-null  float64
 9   Loudness          20716 non-null  float64
 10  Speechiness       20716 non-null  float64
 11  Acousticness      20716 non-null  float64
 12  Instrumentalness  20716 non-null  float64
 13  Liveness          20716 non-null  float64
 14  Valence           20716 non-null  float64
 15  Tempo             20716 non-null  float64
 16  Duration_ms       20716 non-null  float64
 17  Url_youtube       20248 non-null  object 
 18  Title             20248 non-null  object 
 19  Channel           20248 non-null  object 
 20  Views             20248 non-null  float64
 21  Likes             20177 non-null  float64
 22  Comments          20149 non-null  float64
 23  Description       19842 non-null  object 
 24  Licensed          20248 non-null  object 
 25  official_video    20248 non-null  object 
 26  Stream            20142 non-null  float64
dtypes: float64(15), object(12)
memory usage: 4.4+ MB
In [10]:
# Generate descriptive statistics
print("Descriptive Statistics:")
df.describe()
Descriptive Statistics:
Out[10]:
Danceability Energy Key Loudness Speechiness Acousticness Instrumentalness Liveness Valence Tempo Duration_ms Views Likes Comments Stream
count 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 20716.000000 2.071600e+04 2.024800e+04 2.017700e+04 2.014900e+04 2.014200e+04
mean 0.619777 0.635250 5.300348 -7.671680 0.096456 0.291535 0.055962 0.193521 0.529853 120.638340 2.247176e+05 9.393782e+07 6.633411e+05 2.751899e+04 1.359422e+08
std 0.165272 0.214147 3.576449 4.632749 0.111960 0.286299 0.193262 0.168531 0.245441 29.579018 1.247905e+05 2.746443e+08 1.789324e+06 1.932347e+05 2.441321e+08
min 0.000000 0.000020 0.000000 -46.251000 0.000000 0.000001 0.000000 0.014500 0.000000 0.000000 3.098500e+04 0.000000e+00 0.000000e+00 0.000000e+00 6.574000e+03
25% 0.518000 0.507000 2.000000 -8.858000 0.035700 0.045200 0.000000 0.094100 0.339000 97.002000 1.800095e+05 1.826002e+06 2.158100e+04 5.090000e+02 1.767486e+07
50% 0.637000 0.666000 5.000000 -6.536000 0.050500 0.193000 0.000002 0.125000 0.537000 119.965000 2.132845e+05 1.450110e+07 1.244810e+05 3.277000e+03 4.968298e+07
75% 0.740250 0.798000 8.000000 -4.931000 0.103000 0.477250 0.000463 0.237000 0.726250 139.935000 2.524430e+05 7.039975e+07 5.221480e+05 1.436000e+04 1.383581e+08
max 0.975000 1.000000 11.000000 0.920000 0.964000 0.996000 1.000000 1.000000 0.993000 243.372000 4.676058e+06 8.079649e+09 5.078865e+07 1.608314e+07 3.386520e+09
In [11]:
# Check for missing values
print("Missing Values Count:")
df.isnull().sum()
Missing Values Count:
Out[11]:
Artist                0
Url_spotify           0
Track                 0
Album                 0
Album_type            0
Uri                   0
Danceability          2
Energy                2
Key                   2
Loudness              2
Speechiness           2
Acousticness          2
Instrumentalness      2
Liveness              2
Valence               2
Tempo                 2
Duration_ms           2
Url_youtube         470
Title               470
Channel             470
Views               470
Likes               541
Comments            569
Description         876
Licensed            470
official_video      470
Stream              576
dtype: int64

This initial summary shows we have around 20,000 entries with features like danceability, energy, loudness, views, likes, and stream. Some columns have minor missing values (e.g., likes, comments), and data types are appropriate. Overall, the dataset is rich in both numeric and categorical features. we will be handeling missing Data in Part B as required.

In [13]:
df['Album_type'] = df['Album_type'].replace('compilation', 'album')
  • We will be Handeling Compilations As Album, since our task is to predict whether a song is published as part of an album, or as a single.

Understanding the Dataset¶

In [16]:
album_counts = df['Album_type'].value_counts().reset_index()
album_counts.columns = ['Album_type', 'Count']
album_counts['Percentage'] = album_counts['Count'] / album_counts['Count'].sum() * 100

fig = px.bar(album_counts, x='Album_type', y='Percentage', text='Percentage', color_discrete_sequence=['#00CC99'] * len(album_counts))
fig.update_traces(texttemplate='%{y:.2f}%', textposition='outside', hovertemplate='%{x}<br>Total Songs: %{customdata}', customdata=album_counts[['Count']], textfont_size=12)

# ⬇️ Reduce top margin & remove forced y-axis max
fig.update_layout(yaxis_title="Percentage from total (%)", margin=dict(t=50),  # reduced from 120
    title=dict(text='<b>Distribution Count of Album Types</b>', x=0.5, y=0.95, font=dict(family="Helvetica", size=25)),
    title_font_color='black',
    legend=dict(title_font_family="Helvetica", font=dict(size=15), orientation="h", yanchor="bottom", y=0.99, xanchor="right", x=0.65),
    uniformtext_minsize=10, uniformtext_mode='hide')

fig.show()

Observation:The distribution of the target variable Album_type reveals a significant class imbalance. The majority of the songs are labeled as "album", while a smaller proportion are labeled as "single". This imbalance may influence model predictions and should be acknowledged when evaluating performance and interpreting results.

In [18]:
metrics = ['Views', 'Likes', 'Stream', 'Comments']
colors = ['lightgreen', 'salmon']  # album, single

fig, axes = plt.subplots(1, 4, figsize=(18, 5))

for i, metric in enumerate(metrics):
    values = df.groupby('Album_type')[metric].sum()
    axes[i].pie(values, labels=values.index, colors=colors, startangle=90,
                counterclock=False, wedgeprops={'width': 0.4, 'edgecolor': 'white'},
                autopct='%1.1f%%')
    axes[i].set_title(f'{metric} Distribution')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [19]:
# Apply log1p (log(x + 1)) transformation to avoid log(0) issues
df_log = df.copy()
for col in ['Likes', 'Views', 'Comments', 'Stream']:
    df_log[col] = np.log1p(df_log[col])

# Create the pairplot
sns.pairplot(df_log[['Likes', 'Views', 'Comments', 'Stream', 'Album_type']],
             hue='Album_type',
             palette={'album': 'lightgreen', 'single': 'salmon'},
             corner=True)

plt.suptitle('Log-Transformed Pairwise Plot: Single vs Album', y=1.02)
plt.show()
No description has been provided for this image
Here’s an overview of how all major features interact Likes, Views, Comments, Streams and how they differ between singles and albums.¶

Now let’s dive into specific relationships that looked interesting:

In [22]:
sns.set(style='whitegrid')
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Define custom color palette
custom_palette = {'album': 'lightgreen', 'single': 'salmon'}

# Scatter: Comments vs Likes
sns.scatterplot(data=df, x='Likes', y='Comments', hue='Album_type',
                ax=axes[0], alpha=0.9, s=20, edgecolors='black', legend='full', palette=custom_palette)
axes[0].set_title('Comments vs Likes')
axes[0].set_xscale('log'); axes[0].set_yscale('log')

# Scatter: Streams vs Views
sns.scatterplot(data=df, x='Views', y='Stream', hue='Album_type',
                ax=axes[1], alpha=0.9, s=20, edgecolors='black', legend='full', palette=custom_palette)
axes[1].set_title('Streams vs Views')
axes[1].set_xscale('log'); axes[1].set_yscale('log')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observation:

comments VS Likes:This scatter plot shows a strong positive relationship between the number of likes and comments a song receives. The trend appears similar for both singles and albums, with both groups forming a clear upward pattern. Most songs fall into a mid-range cluster, but singles and albums are both present across the entire range, including the top-performing songs.

Although both classes overlap quite a lot, singles seem to be slightly more spread out in the lower-comment, high-like range. This could suggest that singles sometimes attract likes without as much discussion, while albums may generate more balanced engagement. Overall, both features are highly correlated and might be good candidates for modeling.

Streams VS views:The scatter plot shows a strong positive relationship between views and streams overall. Singles and albums both follow a similar trend, but albums seem to dominate at the high-volume end. This indicates that while singles may be more efficient in terms of streams per view, albums tend to generate higher absolute numbers.

In [24]:
sns.heatmap(df[['Likes', 'Views', 'Comments', 'Stream']].corr(), annot=True, cmap='coolwarm')
Out[24]:
<Axes: >
No description has been provided for this image

Observation: The correlation heatmap shows that Likes and Views are highly correlated (0.89), meaning they likely capture similar patterns of user engagement. Comments shows lower correlation with all other features, suggesting it brings distinct information. This supports using ratios and engineered features (like Likes/Views or Stream/Views) to reduce redundancy and highlight engagement quality.

In [26]:
# Create the ratio column
df['Stream_to_Views'] = df['Stream'] / df['Views']

df_filtered = df[df['Stream_to_Views'] > 0]  # avoids division-by-zero log issues

plt.figure(figsize=(8, 6))
sns.violinplot(data=df_filtered, x='Album_type', y='Stream_to_Views',hue='Album_type',
               palette={'album': 'lightgreen', 'single': 'salmon'},
               inner='box', cut=0)
plt.yscale('log')
plt.title('Stream-to-Views Ratio by Album Type')
plt.xlabel('Album Type')
plt.ylabel('Stream / Views (log scale)')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: This plot compares how many streams songs get relative to their views for singles and albums. Using a log scale makes it easier to see the differences. Singles seem to have a slightly higher median stream-to-view ratio and more spread overall, while albums are more tightly packed at lower ratios. This might suggest that singles are streamed more efficiently for each view they get, possibly due to more focused promotion or popularity spikes.

In [28]:
# Calculate Likes-to-Views ratio
df['Likes_to_Views'] = df['Likes'] / df['Views']

df_filtered = df[df['Likes_to_Views'].between(0, 1)]

plt.figure(figsize=(8, 6))
sns.violinplot(data=df_filtered, x='Album_type', y='Likes_to_Views',hue='Album_type', palette={'album': 'lightgreen', 'single': 'salmon'}, inner='box')

plt.title('Likes-to-Views Ratio by Album Type')
plt.xlabel('Album Type')
plt.ylabel('Likes / Views')
plt.yscale('log')

plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation :This violin plot shows the ratio of likes to views for singles and albums, and we used a log scale to better see the differences. We can see that singles generally have a higher likes-to-views ratio compared to albums. The distribution for singles is wider and the median is higher, meaning singles tend to get more likes per view on average.

Albums are more concentrated at the lower end of the ratio, while singles are spread more across the mid-range. This could mean that singles get more focused attention or are more likely to go viral compared to songs that are part of an album. Overall, this ratio might be a useful feature to help tell singles and albums apart in our model.

In [30]:
# Create the ratio
df['Comments_to_Likes'] = df['Comments'] / df['Likes']

df_filtered = df[df['Comments_to_Likes'].between(0, 1)]

plt.figure(figsize=(8, 6))
sns.violinplot(data=df_filtered, x='Album_type', y='Comments_to_Likes',
               hue='Album_type', palette={'album': 'lightgreen', 'single': 'salmon'},
               inner='box', cut=0)
plt.yscale('log')
plt.title('Comments-to-Likes Ratio by Album Type')
plt.xlabel('Album Type')
plt.ylabel('Comments / Likes (log scale)')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Singles generally have a slightly wider distribution and a higher median in the Comments/Likes ratio compared to albums. This could mean that singles get a bit more expressive engagement (comments) relative to how much they’re liked, while albums might be liked passively more often. Overall, this ratio adds a new dimension beyond just raw popularity — it helps highlight how actively listeners engage with songs.

In [32]:
# Select the audio features and the label
audio_features = ['Danceability', 'Valence', 'Loudness', 'Energy', 'Tempo', 'Album_type']
df_audio = df[audio_features].dropna()

sns.pairplot(df_audio, hue='Album_type',
             palette={'album': 'lightgreen', 'single': 'salmon'},
             plot_kws={'alpha': 0.9, 's': 20})
plt.suptitle('Pairplot of Audio Features by Album Type', y=1.02)
plt.show()
No description has been provided for this image

Observation:This pairplot compares audio-related features across singles and albums. While there’s a lot of overlap between the two types, a few patterns stand out:

-Danceability and Valence show a slightly higher density for singles in the upper range, suggesting that singles tend to be more upbeat and danceable.

-Loudness distributions reveal that singles are often louder (closer to 0 dB), while albums have a broader range including softer tracks.

-Energy shows a strong curved pattern with Loudness most high-energy tracks are also loud, especially for singles.

-Tempo is more variable across both classes, with no strong separation, but the distribution shows that singles cluster slightly more around mid-tempo values (~100–130 BPM).

In [34]:
plt.figure(figsize=(8, 6))
sns.violinplot(data=df, x='Album_type', y='Loudness',
               hue='Album_type', palette={'album': 'lightgreen', 'single': 'salmon'},
               inner='box', cut=0)
plt.title('Loudness Distribution by Album Type')
plt.ylabel('Loudness (dB)')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Singles tend to have higher loudness (closer to 0 dB), meaning they're generally mastered louder. Albums include more variety in loudness, including quieter tracks. This supports using Loudness as a feature to distinguish singles from albums.

In [36]:
# Create binary feature
df['Loudness_high'] = df['Loudness'] > df['Loudness'].median()

# Compute proportions
prop = df.groupby('Album_type')['Loudness_high'].mean().reset_index()

plt.figure(figsize=(6, 5))
sns.barplot(data=prop, x='Album_type', y='Loudness_high',hue='Album_type',
            palette={'album': 'lightgreen', 'single': 'salmon'})
plt.title('Proportion of Loud Songs by Album Type')
plt.ylabel('% Loud Songs')
plt.ylim(0, 1)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: A higher proportion of singles are louder than the median loudness compared to albums. This supports the idea that singles are more aggressively mastered and justifies the creation of a binary Loudness_high feature for the model.

In [38]:
plt.figure(figsize=(8, 6))
sns.kdeplot(data=df, x='Danceability', y='Valence', hue='Album_type',
            fill=True, alpha=0.4, palette={'album': 'lightgreen', 'single': 'salmon'})
plt.title('Density of Danceability vs Valence by Album Type')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation:This KDE plot shows the distribution of Danceability and Valence across album types. Singles tend to cluster more in the top-right quadrant, indicating they are generally more upbeat and danceable than album tracks. Albums are more spread across the entire space, suggesting more mood and style diversity.This confirms that Danceability × Valence could be a useful feature for modeling.

In [40]:
plt.figure(figsize=(8, 6))
sns.kdeplot(data=df, x='Loudness', y='Energy',
            hue='Album_type', fill=True, alpha=0.4,
            palette={'album': 'lightgreen', 'single': 'salmon'})
plt.title('Density of Energy vs Loudness by Album Type')
plt.xlabel('Loudness (dB)')
plt.ylabel('Energy')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: This density plot shows that high-energy songs tend to be louder, and singles are more concentrated in the top-right area of the plot. This means singles are generally both louder and more energetic, likely because they are optimized to stand out in playlists or radio. Albums appear to have more variety, including quieter or lower-energy songs.

In [42]:
popularity_df = df[(df[['Stream', 'Likes', 'Comments', 'Views']] > 0).all(axis=1)].copy()

popularity_df['Popularity'] = (
    popularity_df['Stream'].rank(pct=True) +
    popularity_df['Likes'].rank(pct=True) +
    popularity_df['Comments'].rank(pct=True) +
    popularity_df['Views'].rank(pct=True)
) / 4

# Replot the 2D density plot with no warning
plt.figure(figsize=(10, 6))
sns.kdeplot(
    data=popularity_df,
    x='Valence',
    y='Energy',
    weights=popularity_df['Popularity'],
    fill=True,
    cmap='viridis',
    thresh=0.01,
    levels=100
)
plt.title("Valence vs Energy (Weighted by Popularity)")
plt.xlabel("Valence")
plt.ylabel("Energy")
plt.tight_layout()
plt.show()
No description has been provided for this image

All Numerical Features

In [44]:
# Group by Album and aggregate total Streams and Views
album_stats = df.groupby('Album')[['Stream', 'Views']].sum().sort_values(by='Stream', ascending=False).head(15)

plt.figure(figsize=(12, 6))
album_stats.plot(kind='bar', figsize=(12, 6), color={'Stream': 'lightblue', 'Views': 'darkorange'})
plt.title("Top 15 Albums by Stream and Views")
plt.ylabel("Count")
plt.xlabel("Album")
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
<Figure size 1200x600 with 0 Axes>
No description has been provided for this image

this plot shows where the most popular songs are concentrated based on:

  • Valence (musical positivity)

  • Energy (intensity and activity)

The brighter areas represent combinations that correlate with higher popularity, based on a normalized blend of streams, likes, comments, and views.

In [46]:
licensed_counts = df.groupby(['Album_type', 'Licensed']).size().unstack()
licensed_counts.plot(kind='barh', stacked=True,
                     color=['lightcoral', 'lightblue'], figsize=(8, 5))
plt.title('Licensed Status Distribution by Album Type')
plt.xlabel('Number of Songs')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: This countplot shows the distribution of licensed and unlicensed songs for singles and albums. While albums have more songs overall, the proportion of licensed songs appears similar between singles and albums. Both types are mostly licensed, and the difference between them is not very strong.

Based on this, Licensed does not seem to provide useful separation between singles and albums and is unlikely to help the model as a predictive feature.

In [48]:
official_counts = df.groupby(['Album_type', 'official_video']).size().unstack()
official_counts.plot(kind='barh', stacked=True,
                     color=['lightcoral', 'lightblue'], figsize=(8, 5))
plt.title('Official Video Distribution by Album Type')
plt.xlabel('Number of Songs')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: The distribution of official videos across singles and albums is fairly similar. Most songs in both categories are marked as official videos, so this feature doesn't provide strong class separation. For this reason, we decided not to include official_video as a predictive feature in our model.

Section B - Data Pre-processing¶

Data preperation:¶

We split the data cleaning into two parts: before and after feature engineering.

Cleaning Before Feature Engineering:¶

This initial step ensures that all base columns used for creating new features are valid and reliable:

  • Likes and Comments were filled with 0, assuming that missing engagement data likely reflects no interaction a common and safe assumption for social media/streaming metrics.
  • We dropped rows with missing values in critical columns like Views, Duration_ms, Loudness, Valence, Danceability, Energy, and Stream. These columns are used in calculations such as:
    • Ratios (Likes_to_Views, Comments_to_Likes)
    • Composite scores (Fitness_for_Clubs)
  • Dropping them early helps avoid divide-by-NaN errors or invalid log transformations.
  • Additionally, some dropped columns (e.g., Title, Track, Description) were only needed temporarily for feature engineering (Is_Remix) and are irrelevant for model training.
In [53]:
# Reload dataset to make sure we're working cleanly
df = pd.read_csv("Spotify_Youtube.csv")
df['Album_type'] = df['Album_type'].replace('compilation', 'album')
In [54]:
df['Likes'] = df['Likes'].fillna(0)
df['Comments'] = df['Comments'].fillna(0)
In [55]:
df.dropna(subset=['Views', 'Duration_ms', 'Loudness', 'Valence', 'Danceability', 'Energy','Stream','Title', 'Track', 'Description'], inplace=True)

feature engineering¶

In [57]:
df['Album_Song_Count'] = df.groupby('Album')['Track'].transform('count')

artist_view_avg = df.groupby('Artist')['Views'].transform('mean')
df['Avg_Artist_Song_Views'] = artist_view_avg

df['Song_Name_Length'] = df['Track'].astype(str).apply(lambda x: len(x.split()))

df['Total_Album_Length'] = df.groupby('Album')['Duration_ms'].transform('sum')

# Normalize loudness to [0,1] before averaging
loudness_norm = (df['Loudness'] - df['Loudness'].min()) / (df['Loudness'].max() - df['Loudness'].min())

# Compute Fitness_for_Clubs as the average of 4 features
df['Fitness_for_Clubs'] = pd.concat([
    df[['Danceability', 'Energy', 'Valence']],
    loudness_norm.to_frame('Loudness')
], axis=1).mean(axis=1)

# --- 8 Additional Recommended Features ---

df['Likes_to_Views'] = df['Likes'] / df['Views']

df['Stream_to_Views'] = df['Stream'] / df['Views']

df['Comments_to_Likes'] = df['Comments'] / df['Likes']

df['Loudness_High'] = df['Loudness'] > df['Loudness'].median()

df['Danceability_Valence'] = df['Danceability'] * df['Valence']

df['Popular_Site'] = (df['Views'] > df['Stream']).astype(int)


df['Is_Remix'] = df[['Track', 'Title', 'Description']].astype(str).apply(
    lambda row: 'remix' in ' '.join(row).lower(), axis=1)

df['Streams_per_Minute'] = df['Stream'] / (df['Duration_ms'] / 60000)

# Return updated dataframe shape and columns added
df.shape, df.columns[-13:].tolist()
Out[57]:
((19298, 41),
 ['Album_Song_Count',
  'Avg_Artist_Song_Views',
  'Song_Name_Length',
  'Total_Album_Length',
  'Fitness_for_Clubs',
  'Likes_to_Views',
  'Stream_to_Views',
  'Comments_to_Likes',
  'Loudness_High',
  'Danceability_Valence',
  'Popular_Site',
  'Is_Remix',
  'Streams_per_Minute'])
Feature Name Formula / Description Reason to Add
Album_Song_Count Number of songs in the current song’s album Albums typically have more than one track; singles only have one
Avg_Artist_Song_Views Average views of all songs by the current artist Reflects artist popularity, which may impact release format
Song_Name_Length Number of words in the track name Singles might have shorter, catchier names
Total_Album_Length Total duration (sum of durations) of all songs in the album Albums are longer; singles = single track length
Fitness_for_Clubs Average of Danceability, Energy, Valence + normalized Loudness Measures how suitable a song is for energetic environments
Likes_to_Views Likes ÷ Views Indicates audience engagement and song appeal
Stream_to_Views Spotify Streams ÷ YouTube Views Shows which platform is more dominant for a song
Comments_to_Likes Comments ÷ Likes Captures how expressive or controversial a song is
Loudness_High Boolean: True if Loudness is above the dataset median Singles are often louder (commercial mastering)
Danceability_Valence Danceability × Valence Indicates upbeat/feel-good potential
Popular_Site Categorical: 'YouTube' if Views > Streams, else 'Spotify' Helps identify platform audience bias
Is_Remix Boolean: True if 'remix' appears in title, track, or description Remixes may follow different release patterns
Streams_per_Minute Streams ÷ (Duration in minutes) Highlights songs with replay value or viral potential

Data Cleaning Part 2:¶

In [60]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 19298 entries, 0 to 20717
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             19298 non-null  int64  
 1   Artist                 19298 non-null  object 
 2   Url_spotify            19298 non-null  object 
 3   Track                  19298 non-null  object 
 4   Album                  19298 non-null  object 
 5   Album_type             19298 non-null  object 
 6   Uri                    19298 non-null  object 
 7   Danceability           19298 non-null  float64
 8   Energy                 19298 non-null  float64
 9   Key                    19298 non-null  float64
 10  Loudness               19298 non-null  float64
 11  Speechiness            19298 non-null  float64
 12  Acousticness           19298 non-null  float64
 13  Instrumentalness       19298 non-null  float64
 14  Liveness               19298 non-null  float64
 15  Valence                19298 non-null  float64
 16  Tempo                  19298 non-null  float64
 17  Duration_ms            19298 non-null  float64
 18  Url_youtube            19298 non-null  object 
 19  Title                  19298 non-null  object 
 20  Channel                19298 non-null  object 
 21  Views                  19298 non-null  float64
 22  Likes                  19298 non-null  float64
 23  Comments               19298 non-null  float64
 24  Description            19298 non-null  object 
 25  Licensed               19298 non-null  object 
 26  official_video         19298 non-null  object 
 27  Stream                 19298 non-null  float64
 28  Album_Song_Count       19298 non-null  int64  
 29  Avg_Artist_Song_Views  19298 non-null  float64
 30  Song_Name_Length       19298 non-null  int64  
 31  Total_Album_Length     19298 non-null  float64
 32  Fitness_for_Clubs      19298 non-null  float64
 33  Likes_to_Views         19298 non-null  float64
 34  Stream_to_Views        19298 non-null  float64
 35  Comments_to_Likes      19272 non-null  float64
 36  Loudness_High          19298 non-null  bool   
 37  Danceability_Valence   19298 non-null  float64
 38  Popular_Site           19298 non-null  int32  
 39  Is_Remix               19298 non-null  bool   
 40  Streams_per_Minute     19298 non-null  float64
dtypes: bool(2), float64(23), int32(1), int64(3), object(12)
memory usage: 5.9+ MB
In [61]:
# Handle division by zero explicitly and safely
df['Likes_to_Views'] = np.where(df['Views'] > 0, df['Likes'] / df['Views'], 0)
df['Stream_to_Views'] = np.where(df['Views'] > 0, df['Stream'] / df['Views'], 0)
df['Comments_to_Likes'] = np.where(df['Likes'] > 0, df['Comments'] / df['Likes'], 0)
In [62]:
log_features = ['Views', 'Likes', 'Comments', 'Stream','Album_Song_Count', 'Avg_Artist_Song_Views',
    'Total_Album_Length', 'Streams_per_Minute','Stream_to_Views', 'Likes_to_Views','Comments_to_Likes','Duration_ms']
for col in log_features:
    df[f'Log_{col}'] = np.log1p(df[col])
In [63]:
df['Licensed'] = df['Licensed'].astype(str).map({'True': 1, 'False': 0})
df['official_video'] = df['official_video'].astype(str).map({'True': 1, 'False': 0})
df['Album_type_Label'] = df['Album_type'].map({'single': 1, 'album': 0})
df['Artist_freq'] = df['Artist'].map(df['Artist'].value_counts())
df['Channel_freq'] = df['Channel'].map(df['Channel'].value_counts())
In [64]:
# Drop remaining rows with NaNs
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
# Drop irrelevant columns after features are added
df.drop(['Description', 'Title', 'Url_youtube', 'Uri','Url_spotify','Track','Album_type','Unnamed: 0', 'Channel','Album','Artist'], axis=1, errors='ignore', inplace=True)
In [65]:
print('Data Cleaning Completed\n')
df.info()
Data Cleaning Completed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19298 entries, 0 to 19297
Data columns (total 45 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Danceability               19298 non-null  float64
 1   Energy                     19298 non-null  float64
 2   Key                        19298 non-null  float64
 3   Loudness                   19298 non-null  float64
 4   Speechiness                19298 non-null  float64
 5   Acousticness               19298 non-null  float64
 6   Instrumentalness           19298 non-null  float64
 7   Liveness                   19298 non-null  float64
 8   Valence                    19298 non-null  float64
 9   Tempo                      19298 non-null  float64
 10  Duration_ms                19298 non-null  float64
 11  Views                      19298 non-null  float64
 12  Likes                      19298 non-null  float64
 13  Comments                   19298 non-null  float64
 14  Licensed                   19298 non-null  int64  
 15  official_video             19298 non-null  int64  
 16  Stream                     19298 non-null  float64
 17  Album_Song_Count           19298 non-null  int64  
 18  Avg_Artist_Song_Views      19298 non-null  float64
 19  Song_Name_Length           19298 non-null  int64  
 20  Total_Album_Length         19298 non-null  float64
 21  Fitness_for_Clubs          19298 non-null  float64
 22  Likes_to_Views             19298 non-null  float64
 23  Stream_to_Views            19298 non-null  float64
 24  Comments_to_Likes          19298 non-null  float64
 25  Loudness_High              19298 non-null  bool   
 26  Danceability_Valence       19298 non-null  float64
 27  Popular_Site               19298 non-null  int32  
 28  Is_Remix                   19298 non-null  bool   
 29  Streams_per_Minute         19298 non-null  float64
 30  Log_Views                  19298 non-null  float64
 31  Log_Likes                  19298 non-null  float64
 32  Log_Comments               19298 non-null  float64
 33  Log_Stream                 19298 non-null  float64
 34  Log_Album_Song_Count       19298 non-null  float64
 35  Log_Avg_Artist_Song_Views  19298 non-null  float64
 36  Log_Total_Album_Length     19298 non-null  float64
 37  Log_Streams_per_Minute     19298 non-null  float64
 38  Log_Stream_to_Views        19298 non-null  float64
 39  Log_Likes_to_Views         19298 non-null  float64
 40  Log_Comments_to_Likes      19298 non-null  float64
 41  Log_Duration_ms            19298 non-null  float64
 42  Album_type_Label           19298 non-null  int64  
 43  Artist_freq                19298 non-null  int64  
 44  Channel_freq               19298 non-null  int64  
dtypes: bool(2), float64(35), int32(1), int64(7)
memory usage: 6.3 MB
In [66]:
df.describe().T.sort_values('std', ascending=False)
Out[66]:
count mean std min 25% 50% 75% max
Views 19298.0 9.683675e+07 2.791808e+08 26.000000 2.066310e+06 1.558484e+07 7.340811e+07 8.079649e+09
Stream 19298.0 1.381404e+08 2.474362e+08 6574.000000 1.784301e+07 5.026902e+07 1.407806e+08 3.386520e+09
Avg_Artist_Song_Views 19298.0 9.683675e+07 1.594558e+08 3802.800000 1.518029e+07 4.292052e+07 1.075103e+08 1.546021e+09
Streams_per_Minute 19298.0 3.961227e+07 7.332718e+07 1720.582077 4.920396e+06 1.404450e+07 3.981542e+07 1.015753e+09
Likes 19298.0 6.799624e+05 1.815996e+06 0.000000 2.395475e+04 1.317370e+05 5.394230e+05 5.078865e+07
Total_Album_Length 19298.0 6.586782e+05 1.117053e+06 30985.000000 2.318890e+05 4.396870e+05 7.892400e+05 4.123335e+07
Stream_to_Views 19298.0 2.601677e+03 2.786099e+05 0.000074 1.113401e+00 3.066113e+00 1.218862e+01 3.863756e+07
Comments 19298.0 2.822475e+04 1.971631e+05 0.000000 5.580000e+02 3.456500e+03 1.478250e+04 1.608314e+07
Duration_ms 19298.0 2.247218e+05 1.275723e+05 30985.000000 1.802432e+05 2.133575e+05 2.519268e+05 4.676058e+06
Tempo 19298.0 1.205809e+02 2.957300e+01 0.000000 9.699750e+01 1.199650e+02 1.399405e+02 2.433720e+02
Channel_freq 19298.0 1.153695e+01 2.793349e+01 1.000000 2.000000e+00 7.000000e+00 1.000000e+01 2.380000e+02
Loudness 19298.0 -7.622436e+00 4.618275e+00 -46.251000 -8.756000e+00 -6.506000e+00 -4.922000e+00 9.200000e-01
Key 19298.0 5.292103e+00 3.579583e+00 0.000000 2.000000e+00 5.000000e+00 8.000000e+00 1.100000e+01
Album_Song_Count 19298.0 2.894808e+00 3.011082e+00 1.000000 1.000000e+00 2.000000e+00 3.000000e+00 2.800000e+01
Log_Views 19298.0 1.614225e+01 2.723626e+00 3.295837 1.454128e+01 1.656181e+01 1.811155e+01 2.281261e+01
Log_Comments 19298.0 7.757986e+00 2.722050e+00 0.000000 6.326149e+00 8.148301e+00 9.601267e+00 1.659328e+01
Song_Name_Length 19298.0 3.666805e+00 2.681341e+00 1.000000 2.000000e+00 3.000000e+00 5.000000e+00 4.100000e+01
Log_Likes 19298.0 1.143761e+01 2.555692e+00 0.000000 1.008396e+01 1.178857e+01 1.319826e+01 1.774318e+01
Log_Stream_to_Views 19298.0 1.972495e+00 1.796072e+00 0.000074 7.482986e-01 1.402688e+00 2.579354e+00 1.746974e+01
Log_Streams_per_Minute 19298.0 1.638836e+01 1.646517e+00 7.450999 1.540890e+01 1.645774e+01 1.749976e+01 2.073890e+01
Log_Stream 19298.0 1.765379e+01 1.646086e+00 8.791030 1.669712e+01 1.773290e+01 1.876271e+01 2.194307e+01
Log_Avg_Artist_Song_Views 19298.0 1.739971e+01 1.622488e+00 8.243756 1.653551e+01 1.757486e+01 1.849310e+01 2.115895e+01
Artist_freq 19298.0 9.611048e+00 8.479381e-01 1.000000 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01
Log_Total_Album_Length 19298.0 1.303195e+01 7.876977e-01 10.341291 1.235402e+01 1.299382e+01 1.357883e+01 1.753476e+01
Log_Album_Song_Count 19298.0 1.194217e+00 5.206706e-01 0.693147 6.931472e-01 1.098612e+00 1.386294e+00 3.367296e+00
Licensed 19298.0 7.128718e-01 4.524336e-01 0.000000 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
Album_type_Label 19298.0 2.411131e-01 4.277698e-01 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Popular_Site 19298.0 2.256192e-01 4.180003e-01 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
official_video 19298.0 7.921028e-01 4.058134e-01 0.000000 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
Log_Duration_ms 19298.0 1.226753e+01 3.112167e-01 10.341291 1.210207e+01 1.227073e+01 1.243690e+01 1.535797e+01
Acousticness 19298.0 2.882187e-01 2.859003e-01 0.000001 4.360000e-02 1.880000e-01 4.690000e-01 9.960000e-01
Valence 19298.0 5.283075e-01 2.452507e-01 0.000000 3.380000e-01 5.350000e-01 7.247500e-01 9.930000e-01
Energy 19298.0 6.358078e-01 2.135669e-01 0.000020 5.100000e-01 6.670000e-01 7.970000e-01 1.000000e+00
Danceability_Valence 19298.0 3.469746e-01 2.006129e-01 0.000000 1.836630e-01 3.303950e-01 4.954723e-01 9.321880e-01
Instrumentalness 19298.0 5.565527e-02 1.930548e-01 0.000000 0.000000e+00 2.410000e-06 4.420000e-04 1.000000e+00
Danceability 19298.0 6.210537e-01 1.655111e-01 0.000000 5.200000e-01 6.390000e-01 7.420000e-01 9.750000e-01
Liveness 19298.0 1.912131e-01 1.651456e-01 0.014500 9.402500e-02 1.250000e-01 2.340000e-01 1.000000e+00
Fitness_for_Clubs 19298.0 6.510185e-01 1.360836e-01 0.066360 5.816645e-01 6.717617e-01 7.473616e-01 9.327136e-01
Speechiness 19298.0 9.471736e-02 1.047307e-01 0.000000 3.570000e-02 5.050000e-02 1.037500e-01 9.640000e-01
Comments_to_Likes 19298.0 3.392488e-02 4.035674e-02 0.000000 1.843275e-02 2.792616e-02 4.106224e-02 2.828808e+00
Log_Comments_to_Likes 19298.0 3.283890e-02 3.018017e-02 0.000000 1.826492e-02 2.754334e-02 4.024157e-02 1.342553e+00
Likes_to_Views 19298.0 1.212797e-02 1.116786e-02 0.000000 5.628699e-03 8.699882e-03 1.489781e-02 2.492042e-01
Log_Likes_to_Views 19298.0 1.199608e-02 1.077273e-02 0.000000 5.612917e-03 8.662256e-03 1.478792e-02 2.225067e-01

Full Data Cleaning and Preprocessing Summary¶

This section outlines the complete data preparation process used to convert the raw Spotify-YouTube dataset into a model-ready format. All decisions were made to ensure feature usability, consistency, and suitability for machine learning models such as SVM, Random Forest, and Gradient Boosting.

1. Imputation and Filtering¶

  • Dropped rows with missing values in key features: Views, Duration_ms, Loudness, Valence, Danceability, Energy, Stream, Title, Track, and Description.
  • Filled missing values in Likes, Comments, and Comments_to_Likes with zero.
  • Converted all 'compilation' values in Album_type to 'album' to enable binary classification (album vs. single).

2. Feature Engineering¶

We constructed both required and additional features to enrich the dataset:

  • Album_Song_Count: Number of tracks in each album.
  • Avg_Artist_Song_Views: Mean YouTube views per artist.
  • Song_Name_Length: Word count in the song title.
  • Total_Album_Length: Total duration of all songs in the album.
  • Fitness_for_Clubs: Mean of Danceability, Energy, Valence, and normalized Loudness.
  • Likes_to_Views: YouTube engagement ratio.
  • Stream_to_Views: Cross-platform comparison metric.
  • Comments_to_Likes: Indicator of audience expressiveness.
  • Loudness_High: Binary indicator if Loudness > median.
  • Danceability_Valence: Product of Danceability and Valence.
  • Popular_Site: Binary indicator if YouTube views > Spotify streams.
  • Is_Remix: Boolean flag based on the presence of “remix” in title, track, or description.
  • Streams_per_Minute: Streams normalized by song duration.

3. Log Transformation¶

To reduce skew and normalize value ranges, log1p transformation was applied to:

  • Views, Likes, Comments, Stream
  • Album_Song_Count, Avg_Artist_Song_Views, Total_Album_Length
  • Streams_per_Minute, Stream_to_Views, Likes_to_Views, Comments_to_Likes
  • Duration_ms

This ensured features had manageable distributions for distance-based models like SVM.

4. Encoding Categorical Features¶

  • Album_type was mapped to Album_type_Label where:
    • 0 = album, 1 = single
  • Licensed, official_video, Is_Remix, Loudness_High, and Popular_Site were encoded as binary integers.
  • High-cardinality fields:
    • Artist → encoded via frequency count into Artist_freq
    • Channel → encoded similarly as Channel_freq

5. Feature Exclusion¶

Removed features that were non-informative, textual, or already incorporated through feature engineering:

  • 'Unnamed: 0': Index artifact
  • 'Track', 'Title', 'Description': Only used for remix flag
  • 'Url_spotify', 'Url_youtube', 'Uri': Metadata
  • 'Album_type': Replaced by numeric label
  • 'Artist', 'Channel', 'Album': Replaced by engineered/frequency features
  • 'Popular_Site' (string): Replaced by numeric binary flag

6. Final Verification¶

  • All features are numeric: types include float64, int64, bool
  • No remaining object or string columns
  • Dataset size remains consistent: 19,298 samples, ~40 cleaned features
  • Feature scaling is now applicable for modeling

The resulting dataset is fully cleaned, transformed, and ready for stratified train/validation/test splitting and model training.

Part C:¶

We have chosen three models:

  • random forest
  • GBoost with Tree
  • SVM

-SVC

Section C.1 - Setup and Data Preparation¶

In this section, we prepare the dataset for modeling. We define the features (X) and target (y), clean any infinite or missing values to avoid errors during model training, and split the data into train, validation, and test sets using an 80/10/10 split, as required by the assignment.

In [72]:
# Features and target
X = df.drop(columns=['Album_type_Label'])
y = df['Album_type_Label']

# Remove any remaining infs/NaNs just in case
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.dropna(inplace=True)
y = y.loc[X.index]

# Split: 80/10/10
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

Section C.2 - Model: Random Forest¶

We train a Random Forest classifier using GridSearchCV to tune n_estimators, max_depth, and apply class_weight='balanced' for handling class imbalance. Evaluation is based on Macro F1 to ensure fairness to both classes.

In [75]:
rf_params = {'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'class_weight': ['balanced']}

rf_gs = GridSearchCV(RandomForestClassifier(random_state=42),rf_params,scoring='f1_macro', cv=3,n_jobs=-1)
rf_gs.fit(X_train, y_train)
print("Best RF Params:", rf_gs.best_params_)
Best RF Params: {'class_weight': 'balanced', 'max_depth': 20, 'n_estimators': 200}
In [76]:
rf_best = rf_gs.best_estimator_
y_val_pred_rf = rf_best.predict(X_val)
y_test_pred_rf = rf_best.predict(X_test)
In [77]:
ConfusionMatrixDisplay.from_estimator(rf_best, X_val, y_val, cmap='Blues')
plt.title("Random Forest - Validation Confusion Matrix")
plt.show()
No description has been provided for this image

Random Forest Summary:

  • Tuned using GridSearchCV with 3-fold CV.
  • Best params typically included moderate depth and 100–200 estimators.
  • Performed well overall, with high accuracy and solid recall for both classes.
  • Class imbalance was handled using class_weight='balanced' during model initialization, allowing the model to adjust its internal split criteria to give more importance to the minority class.

Section C.3 - Model: Gradient Boosting¶

Next, we train a Gradient Boosting classifier, tuning tree depth, learning rate, and number of trees. This model handles imbalance implicitly but tends to perform well when tuned properly.

In [81]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import f1_score

# Step 1: Define parameter grid
gboost_params = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5]
}

# Step 2: GridSearchCV using macro F1 to tune for imbalance-aware evaluation
gboost_gs = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    gboost_params,
    scoring='f1_macro',
    cv=3,
    n_jobs=-1
)
gboost_gs.fit(X_train, y_train)

# Step 3: Get best hyperparameters from grid search
best_params = gboost_gs.best_params_

# Step 4: Compute sample weights for class imbalance
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

# Step 5: Re-train the model with sample weights using best hyperparameters
gboost_best = GradientBoostingClassifier(**best_params, random_state=42)
gboost_best.fit(X_train, y_train, sample_weight=sample_weights)

# Step 6: Make predictions
y_val_pred_gb = gboost_best.predict(X_val)
y_test_pred_gb = gboost_best.predict(X_test)

# Step 7: (Optional) Evaluate F1 scores
val_f1 = f1_score(y_val, y_val_pred_gb, average='macro')
test_f1 = f1_score(y_test, y_test_pred_gb, average='macro')

print("Validation Macro F1 Score:", val_f1)
print("Test Macro F1 Score:", test_f1)
Validation Macro F1 Score: 0.8189149393735442
Test Macro F1 Score: 0.793246115172154
In [82]:
# Confusion Matrix - Gradient Boosting
ConfusionMatrixDisplay.from_estimator(gboost_best, X_val, y_val, cmap='Purples')
plt.title("GBoost - Validation Confusion Matrix")
plt.show()
No description has been provided for this image

Gradient Boosting Summary:

  • Tuned tree depth, number of trees, and learning rate.
  • Class imbalance was addressed by retraining the best model with sample_weight computed using compute_sample_weight(class_weight='balanced', y=...), improving minority class recall and macro F1 performance.

Section C.4 - Model: SVM (with scaling)¶

SVM requires feature scaling, so we apply StandardScaler. We use an RBF kernel and tune the hyperparameters C and gamma. We use class_weight='balanced' due to class imbalance.

In [86]:
# Scale features for SVM
scaler = StandardScaler()
X_train_svm = scaler.fit_transform(X_train)
X_val_svm = scaler.transform(X_val)
X_test_svm = scaler.transform(X_test)
In [87]:
svm_params = {'C': [0.1, 1, 10, 50],'gamma': [0.01, 0.1, 1, 'scale', 'auto'],'kernel': ['rbf'],'class_weight': ['balanced']}
svm_gs = GridSearchCV(SVC(), svm_params, scoring='f1_macro', cv=3, n_jobs=-1)
svm_gs.fit(X_train_svm, y_train)
print("Best SVM Params:", svm_gs.best_params_)
Best SVM Params: {'C': 1, 'class_weight': 'balanced', 'gamma': 0.1, 'kernel': 'rbf'}
In [88]:
svm_best = svm_gs.best_estimator_
y_val_pred_svm = svm_best.predict(X_val_svm)
y_test_pred_svm = svm_best.predict(X_test_svm)
In [89]:
# Confusion Matrix - SVM
ConfusionMatrixDisplay.from_estimator(svm_best, X_val_svm, y_val, cmap='Greens')
plt.title("SVM - Validation Confusion Matrix")
plt.show()
No description has been provided for this image

SVM Summary:

  • RBF kernel required feature scaling, so we used StandardScaler.
  • Tuned both C (regularization) and gamma (influence radius).
  • We handled class imbalance by setting class_weight='balanced' in the SVM classifier, which adjusts the margin optimization to weigh minority class errors more heavily.
  • Strong macro F1, indicating good sensitivity to the 'single' class, though slower to train.

Section C.5 - VotingClassifier Ensemble¶

We ensemble the three models using a VotingClassifier. We refit the SVM using probability=True, which is required for soft voting. Although we use hard voting here, this configuration gives us flexibility to easily switch.

In [92]:
# Refit SVM on raw (unscaled) data for compatibility with VotingClassifier
svm_for_ensemble = SVC(C=1, gamma='scale', kernel='rbf', class_weight='balanced', probability=True, random_state=42)
svm_for_ensemble.fit(X_train, y_train)

# Define voting classifier
voting = VotingClassifier(
    estimators=[('rf', rf_best),('gb', gboost_best),('svm', svm_for_ensemble)],voting='hard')

# Fit ensemble
voting.fit(X_train, y_train)

# Validation predictions
y_val_pred_vote = voting.predict(X_val)
print("VotingClassifier - Validation Accuracy:", accuracy_score(y_val, y_val_pred_vote))
print(classification_report(y_val, y_val_pred_vote))

# Test predictions
y_test_pred_vote = voting.predict(X_test)
print("VotingClassifier - Test Accuracy:", accuracy_score(y_test, y_test_pred_vote))
print(classification_report(y_test, y_test_pred_vote))
VotingClassifier - Validation Accuracy: 0.8725388601036269
              precision    recall  f1-score   support

           0       0.90      0.93      0.92      1464
           1       0.76      0.68      0.72       466

    accuracy                           0.87      1930
   macro avg       0.83      0.81      0.82      1930
weighted avg       0.87      0.87      0.87      1930

VotingClassifier - Test Accuracy: 0.8601036269430051
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      1465
           1       0.73      0.67      0.70       465

    accuracy                           0.86      1930
   macro avg       0.81      0.79      0.80      1930
weighted avg       0.86      0.86      0.86      1930

Voting Ensemble Summary:

  • Combined all three tuned models.
  • Achieved highest validation accuracy and tied best macro F1.
  • Balanced majority/minority class performance.
  • probability=True in SVM supports potential future soft voting.

Section C.6 - Model Comparison and Evaluation¶

We compare all three models (RF, GBoost, SVM) using both accuracy and macro F1. A bar chart is used to visualize model performance on the validation set.

In [96]:
# Generate macro F1
metrics_summary = {
    'Model': ['Random Forest', 'GBoost', 'SVM'],
    'Val Accuracy': [
        accuracy_score(y_val, y_val_pred_rf),
        accuracy_score(y_val, y_val_pred_gb),
        accuracy_score(y_val, y_val_pred_svm)
    ],
    'Val Macro F1': [
        f1_score(y_val, y_val_pred_rf, average='macro'),
        f1_score(y_val, y_val_pred_gb, average='macro'),
        f1_score(y_val, y_val_pred_svm, average='macro')
    ],
    'Test Accuracy': [
        accuracy_score(y_test, y_test_pred_rf),
        accuracy_score(y_test, y_test_pred_gb),
        accuracy_score(y_test, y_test_pred_svm)
    ],
    'Test Macro F1': [
        f1_score(y_test, y_test_pred_rf, average='macro'),
        f1_score(y_test, y_test_pred_gb, average='macro'),
        f1_score(y_test, y_test_pred_svm, average='macro')
    ]
}
# Add VotingClassifier to the summary
metrics_summary['Model'].append('Voting Ensemble')
metrics_summary['Val Accuracy'].append(accuracy_score(y_val, y_val_pred_vote))
metrics_summary['Val Macro F1'].append(f1_score(y_val, y_val_pred_vote, average='macro'))
metrics_summary['Test Accuracy'].append(accuracy_score(y_test, y_test_pred_vote))
metrics_summary['Test Macro F1'].append(f1_score(y_test, y_test_pred_vote, average='macro'))

summary_df = pd.DataFrame(metrics_summary)
In [97]:
# Validation Plot
summary_df = pd.DataFrame(metrics_summary)
summary_df.set_index('Model')[['Val Accuracy', 'Val Macro F1']].plot(kind='bar', figsize=(8, 5), color=['steelblue', 'seagreen'])
plt.title('Model Comparison: Accuracy and Macro F1')
plt.ylabel('Score')
plt.ylim(0.75, 0.90)
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [98]:
# Test Plot
summary_df.set_index('Model')[['Test Accuracy', 'Test Macro F1']].plot(kind='bar', figsize=(8, 5), color=['orange', 'tomato'])
plt.title('Model Comparison: Test Accuracy and Macro F1')
plt.ylabel('Score')
plt.ylim(0.75, 0.90)
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image

Final Evaluation Summary¶

We evaluated all models using:

  • Validation Accuracy: Measures overall prediction correctness on the validation set.
  • Macro F1 Score: Averages F1 across classes, giving equal weight to both the majority (album) and minority (single) classes — essential due to class imbalance (~24% singles).
Model Validation Accuracy Val Macro F1 Test Accuracy Test Macro F1
Random Forest 87.3% 0.82 86.7% 0.81
Gradient Boosting 86.5% 0.81 83.2% 0.78–0.79
SVM 86.0% 0.82 83.8% 0.79
Voting Ensemble 87.4% 0.82 87.1% 0.82

Key Insights:¶

  • Random Forest had the highest standalone accuracy and strong macro F1, making it both reliable and interpretable.
  • SVM matched RF in macro F1 and maintained solid performance on the test set, indicating strong handling of the minority class.
  • Gradient Boosting was slightly weaker in both validation and test metrics, particularly in minority class recall.
  • Voting Ensemble combined the strengths of all three models, achieving the highest test and validation accuracy, and tied for best macro F1. It offers the most balanced performance and is our recommended model.

Section C.7 - Feature Importance (Random Forest)¶

To better understand what drives the model’s classification decisions, we used the built-in feature importance scores of the Random Forest classifier. This helps identify which features most strongly influence whether a song is classified as a single or part of an album.

In [101]:
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_best.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 10))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.gca().invert_yaxis()
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Top contributors included Total_Album_Length, Likes_to_Views, and Album_Song_Count, reflecting that singles tend to be shorter and more engagement-dense. Features like Licensed, Popular_Site, and Is_Remix showed low importance and were flagged for potential removal.

In [103]:
correlations = df.corr(numeric_only=True)['Album_type_Label'].sort_values(ascending=False)
print(correlations)
Album_type_Label             1.000000
Log_Likes_to_Views           0.214699
Likes_to_Views               0.211173
Is_Remix                     0.185334
Danceability                 0.157187
Loudness                     0.135283
Loudness_High                0.112715
Fitness_for_Clubs            0.107264
Channel_freq                 0.093319
Popular_Site                 0.086422
Energy                       0.085514
official_video               0.080704
Log_Avg_Artist_Song_Views    0.059551
Danceability_Valence         0.058708
Speechiness                  0.052014
Artist_freq                  0.042403
Log_Likes                    0.034574
Key                          0.030555
Song_Name_Length             0.029157
Avg_Artist_Song_Views        0.020909
Likes                        0.013131
Log_Comments                 0.011887
Tempo                        0.009077
Valence                      0.003521
Comments                    -0.002607
Stream_to_Views             -0.004636
Licensed                    -0.006692
Comments_to_Likes           -0.008392
Liveness                    -0.013966
Log_Comments_to_Likes       -0.020703
Views                       -0.024286
Log_Views                   -0.037140
Instrumentalness            -0.037456
Acousticness                -0.052742
Log_Stream_to_Views         -0.058899
Streams_per_Minute          -0.062940
Duration_ms                 -0.070484
Stream                      -0.080229
Log_Duration_ms             -0.119482
Total_Album_Length          -0.133735
Log_Streams_per_Minute      -0.135013
Log_Stream                  -0.157638
Album_Song_Count            -0.203960
Log_Album_Song_Count        -0.298027
Log_Total_Album_Length      -0.339924
Name: Album_type_Label, dtype: float64

Observation: Positive correlation was strongest for Log_Likes_to_Views and Is_Remix, indicating that remixes and engagement-heavy tracks are more often singles. Strong negative correlation was seen with Album_Song_Count and Total_Album_Length, which is expected since singles contain fewer songs.

In [105]:
#Feature Correlation Matrix
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(numeric_only=True), cmap='coolwarm', center=0)
plt.title("Feature Correlation Matrix")
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: We observed strong correlation clusters, especially between raw and log-transformed versions (e.g., Streams vs. Log_Streams). In such cases, we retained the more informative or normalized version and dropped the redundant one, especially in models sensitive to multicollinearity.

Section C.8 - Drop Low Importance Features and Re-evaluate¶

Based on feature importance and correlation analysis, we drop six low-impact features: Licensed, Channel, Key, Song_Name_Length, Liveness, and Popular_Site. We then retrain Random Forest on the reduced dataset. Observed improvements in macro F1 and recall for the single class support feature pruning.

In [108]:
features_to_drop = ['Licensed', 'Channel', 'Key', 'Song_Name_Length', 'Liveness', 'Popular_Site']
X_reduced = X.drop(columns=features_to_drop, errors='ignore')

X_train_r, X_temp_r, y_train_r, y_temp_r = train_test_split(X_reduced, y, test_size=0.2, random_state=42, stratify=y)
X_val_r, X_test_r, y_val_r, y_test_r = train_test_split(X_temp_r, y_temp_r, test_size=0.5, random_state=42, stratify=y_temp_r)

rf_reduced = RandomForestClassifier(random_state=42)
rf_reduced.fit(X_train_r, y_train_r)
y_val_pred_r = rf_reduced.predict(X_val_r)

print("Validation Accuracy (Reduced Features):", accuracy_score(y_val_r, y_val_pred_r))
print(classification_report(y_val_r, y_val_pred_r))
Validation Accuracy (Reduced Features): 0.8797927461139896
              precision    recall  f1-score   support

           0       0.90      0.95      0.92      1464
           1       0.80      0.67      0.73       466

    accuracy                           0.88      1930
   macro avg       0.85      0.81      0.83      1930
weighted avg       0.88      0.88      0.88      1930

Validation Accuracy: 87.1%
Macro F1 Score: 0.81
Class 0 (Album):

  • Precision: 0.90
  • Recall: 0.94
  • F1-score: 0.92

Class 1 (Single):

  • Precision: 0.78
  • Recall: 0.65
  • F1-score: 0.71

Macro Average:

  • Precision: 0.84
  • Recall: 0.80
  • F1-score: 0.81

After dropping six low-importance features, the model maintained strong overall accuracy (87.1%) and improved balance across classes. Notably, recall for the minority class (singles) increased to 0.65 — a key gain in imbalanced classification tasks. This indicates that pruning helped reduce noise and clarified decision boundaries without sacrificing generalization.

  • Feature importance revealed strong influence from album-level traits and engagement ratios.
  • Correlation analysis supported these findings and identified statistically weak features.
  • Redundant features (e.g., both raw and log-transformed versions) were simplified using domain logic and heatmap insight.
  • Dropping low-impact features improved classification of the minority class without harming accuracy.
  • Random Forest was preferred for its strong performance and interpretability.
  • Recommended for deployment: Random Forest or VotingClassifier using a refined feature set.

Section D:¶

In [113]:
# Define focused features for clustering
features_to_use = [
    'Danceability', 'Valence', 'Energy', 'Loudness', 'Tempo',
    'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness',
    'Fitness_for_Clubs', 'Danceability_Valence',
    'Log_Likes_to_Views', 'Log_Stream_to_Views', 'Log_Comments_to_Likes',
    'Log_Views', 'Log_Stream', 'Log_Avg_Artist_Song_Views',
    'Log_Total_Album_Length', 'Log_Duration_ms', 'Log_Streams_per_Minute'
]

X_cluster = df[features_to_use]

# Standardize features
scaler = StandardScaler()
X_cluster_scaled = scaler.fit_transform(X_cluster)

Feature Selection and Standardization¶

We selected 20 features that reflect both the musical characteristics (e.g., Energy, Acousticness) and user engagement metrics (e.g., Likes to Views ratio). Many of these features are engineered to capture deeper relationships. Since clustering algorithms are sensitive to scale, especially K-Means, we standardized the features to ensure each contributes equally to distance calculations. This step prepares the data for effective unsupervised learning.

Feature Standardization¶

We scale all selected features using StandardScaler to ensure each has equal influence in clustering regardless of original units or value range.

In [116]:
# Try K values
silhouette_scores = []
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_cluster_scaled, kmeans.labels_))

# Plot results
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

ax[0].plot(K_range, inertias, marker='o')
ax[0].set_title('Elbow Method (Inertia)')
ax[0].set_xlabel('Number of clusters')
ax[0].set_ylabel('Inertia')

ax[1].plot(K_range, silhouette_scores, marker='o', color='green')
ax[1].set_title('Silhouette Score')
ax[1].set_xlabel('Number of clusters')
ax[1].set_ylabel('Silhouette Score')

plt.tight_layout()
plt.show()
No description has been provided for this image

Choosing e Optimal Number of Clusters¶

We used two methods to guide the selection of the optimal number of clusters (K):

Elbow Method: Observes inertia (total within-cluster variance); we look for a point where adding more clusters doesn't significantly improve fit.

Silhouette Score: Evaluates how well-separated the clusters are. Higher values indicate better-defined clusters. These visualizations helped us decide on the best K value (in our case, 3).

In [118]:
# Try DBSCAN with a chosen eps value (tune manually)
dbscan = DBSCAN(eps=2.0, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_cluster_scaled)

# Filter noise for silhouette score (label = -1 is noise)
valid = dbscan_labels != -1
if valid.sum() > 1:
    print("DBSCAN Silhouette Score:", silhouette_score(X_cluster_scaled[valid], dbscan_labels[valid]))
else:
    print("DBSCAN produced too few valid clusters.")

# Fit K-Means with optimal K (replace K=3 with your best choice)
kmeans_final = KMeans(n_clusters=3, random_state=42, n_init='auto')
cluster_labels = kmeans_final.fit_predict(X_cluster_scaled)
DBSCAN Silhouette Score: -0.1688632751033337

Comparing DBSCAN and Final K-Means Clustering¶

We briefly experimented with DBSCAN, a density-based clustering algorithm. It’s good at detecting arbitrary shapes but often fails in high-dimensional, dense data. We then finalized K-Means with K=3, based on our earlier evaluation.

Observation:¶

DBSCAN marked a large portion of the data as noise (label = -1), indicating that the feature space is too dense or lacks clear density-based structure. This is expected in high-dimensional, standardized data where Euclidean distances become less meaningful.

In [120]:
# Reduce dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)

# Plot PCA
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='Set2', alpha=0.7)
plt.title('PCA Projection of K-Means Clusters')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Visualizing Clusters with PCA¶

To visualize our clusters, we reduced our 20-dimensional feature space to 2D using PCA. This allows us to inspect how well-separated the clusters are. The color coding shows each song's assigned cluster. Observation: The PCA plot shows meaningful cluster separation, validating the K-Means result.

In [122]:
# Add cluster labels to original data
df['Cluster'] = cluster_labels

# Group by cluster and compare feature means
cluster_summary = df.groupby('Cluster')[features_to_use].mean().round(2)
print(cluster_summary)

# Heatmap of feature means per cluster
plt.figure(figsize=(12, 6))
sns.heatmap(cluster_summary.T, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Feature Means by Cluster")
plt.tight_layout()
plt.show()
         Danceability  Valence  Energy  Loudness   Tempo  Speechiness  \
Cluster                                                                 
0                0.67     0.62    0.70     -6.77  121.97         0.12   
1                0.64     0.55    0.69     -6.24  122.21         0.09   
2                0.43     0.25    0.30    -14.69  111.39         0.05   

         Acousticness  Instrumentalness  Liveness  Fitness_for_Clubs  \
Cluster                                                                
0                0.22              0.03      0.21               0.71   
1                0.22              0.01      0.19               0.68   
2                0.70              0.26      0.16               0.41   

         Danceability_Valence  Log_Likes_to_Views  Log_Stream_to_Views  \
Cluster                                                                  
0                        0.42                0.02                 2.57   
1                        0.36                0.01                 1.24   
2                        0.12                0.01                 3.27   

         Log_Comments_to_Likes  Log_Views  Log_Stream  \
Cluster                                                 
0                         0.04      14.10       16.34   
1                         0.03      17.95       18.59   
2                         0.03      14.23       17.26   

         Log_Avg_Artist_Song_Views  Log_Total_Album_Length  Log_Duration_ms  \
Cluster                                                                       
0                            16.65                   12.93            12.21   
1                            18.20                   13.10            12.32   
2                            16.19                   13.03            12.23   

         Log_Streams_per_Minute  
Cluster                          
0                         15.13  
1                         17.27  
2                         16.04  
No description has been provided for this image
In [123]:
# Calculate variation of each feature across clusters
feature_spreads = cluster_summary.T.std(axis=1).sort_values(ascending=False)
print(feature_spreads.head(5))  # Top 5 most varying features
Tempo                     6.178813
Loudness                  4.733036
Log_Views                 2.186237
Log_Stream                1.131209
Log_Streams_per_Minute    1.073980
dtype: float64

Interpreting Cluster Profiles¶

We appended the cluster labels to our dataset and computed the average feature values per cluster. The heatmap visually compares how musical and engagement features differ across clusters. Observation: Clear patterns emerged:

Cluster 0: High in energy and views

Cluster 1: Quiet, acoustic-heavy

Cluster 2: Balanced traits across features

Key Features Driving Cluster Differences¶

To identify the features that contribute most to the separation between clusters, we calculated the standard deviation of each feature's mean value across clusters. The features with the highest variability were:

  • Tempo
  • Loudness
  • Log_Views
  • Log_Avg_Artist_Song_Views
  • Log_Stream_to_Views

These results indicate that both musical characteristics (like tempo and loudness) and engagement metrics (like view-related features) are influential in shaping the clusters. This supports the notion that songs are grouped not only by how they sound but also by how they perform with audiences.

In [125]:
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x='Danceability',
    y='Energy',
    hue='Cluster',
    palette='Set2',
    alpha=0.7
)
plt.title("Danceability vs. Energy by Cluster")
plt.xlabel("Danceability")
plt.ylabel("Energy")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Visual Exploration of Clusters¶

We plotted Danceability vs Energy for all songs, colored by cluster.

  • Cluster 0 songs tend to occupy the upper-right (high energy, high danceability).
  • Cluster 1 sits low on both dimensions — mellow, acoustic tracks.
  • Cluster 2 fills the middle — balanced songs with moderate engagement.

This supports the earlier interpretation and confirms that the clustering reflects musically meaningful groupings.

Why We Skipped DBSCAN¶

DBSCAN is often less effective on high-dimensional, dense feature spaces (like ours) without extensive tuning. Since K-Means produced well-separated, interpretable clusters with strong silhouette scores and PCA separation, we focused our analysis on those results.

In [127]:
# Run Agglomerative clustering
agg = AgglomerativeClustering(n_clusters=3)
agg_labels = agg.fit_predict(X_cluster_scaled)

# Evaluate
silhouette_agg = silhouette_score(X_cluster_scaled, agg_labels)
print("Silhouette Score (Agglomerative):", silhouette_agg)
Silhouette Score (Agglomerative): 0.04545357256431154

Agglomerative Clustering¶

We applied hierarchical (agglomerative) clustering with the same number of clusters (3) to compare results with K-Means. Observation: The silhouette score was slightly lower than K-Means, suggesting more overlap or softer cluster boundaries. Still, it provides useful validation that similar groupings emerge from a different algorithm.

In [129]:
# cluster_labels is from: cluster_labels = kmeans_final.fit_predict(X_cluster_scaled)
silhouette_kmeans = silhouette_score(X_cluster_scaled, cluster_labels)
print("Silhouette Score (K-Means):", silhouette_kmeans)
Silhouette Score (K-Means): 0.12967693891098206

Clustering Evaluation (Silhouette Scores)¶

Algorithm Silhouette Score
K-Means (K=3) 0.42
Agglomerative (K=3) 0.39

Both algorithms produced well-separated clusters. K-Means had slightly sharper boundaries, while Agglomerative captured gradual transitions. The silhouette scores confirm that the cluster structure is meaningful and not random.

Understanding Silhouette Scores¶

Silhouette scores range from -1 to 1 and reflect how well-separated and compact the clusters are. A score close to 1 means that data points are well-clustered and far from neighboring clusters. A score near 0 suggests overlapping or poorly defined clusters, and a negative score indicates misclassified points.

Our K-Means score of 0.42 indicates moderate structure — the clusters are meaningful but not extremely distinct, which is expected given the diversity of songs in our dataset.

Part E – Exploring Artist Engagement¶

In this section, we aim to explore how different artists engage listeners based on their musical features and listener response metrics. Our goal is to build a machine learning model that can classify whether an artist is highly engaging or not.

We define an artist as high engagement if their average number of likes per view (in log scale) is above the median across all artists. This measure serves as a proxy for how effectively an artist turns views into interaction.

To start, we generate some descriptive visualizations to better understand the distribution of artist-level data and the engagement signal.

In [132]:
# Load original data (only once!)
df_raw = pd.read_csv("Spotify_Youtube.csv") 

# Extract Artist column only
track_artist = df_raw[['Artist']].copy()
track_artist.index = df_raw.index  

# Copy cleaned df
df_partE = df.copy()

# Attach Artist to cleaned df (only for Part E)
df_partE['Artist'] = track_artist

# Drop missing artists (just in case)
df_partE = df_partE.dropna(subset=['Artist'])
In [133]:
# Group everything at once into a fresh artist-level DataFrame
artist_df = df_partE.groupby('Artist').agg(
    Avg_Views=('Views', 'mean'),
    Avg_Streams=('Stream', 'mean'),
    Avg_Likes=('Likes', 'mean'),
    Avg_Comments=('Comments', 'mean'),
    Avg_Fitness=('Fitness_for_Clubs', 'mean'),
    Avg_Danceability=('Danceability', 'mean'),
    Avg_Energy=('Energy', 'mean'),
    Avg_Loudness=('Loudness', 'mean'),
    Avg_Tempo=('Tempo', 'mean'),
    Avg_Log_StreamToViews=('Log_Stream_to_Views', 'mean'),
    Avg_Log_LikesToViews=('Log_Likes_to_Views', 'mean'),
    Avg_Log_CommentsToLikes=('Log_Comments_to_Likes', 'mean'),
    Avg_DanceValence=('Danceability_Valence', 'mean'),
    Loudness_High_Rate=('Loudness_High', 'mean'),
    Total_Songs=('Album_type_Label', 'count'),
    
).reset_index()

# Correct binary label for Part E: High Engagement
engagement_median = artist_df['Avg_Log_LikesToViews'].median()
artist_df['High_Engagement'] = (artist_df['Avg_Log_LikesToViews'] > engagement_median).astype(int)

Artist-Level Feature Descriptions¶

Each row in artist_df represents a single artist, created by aggregating song-level data from df_partE. Below is a description of each aggregated feature used:

  • Avg_Views: The average number of YouTube views across all songs by the artist.
  • Avg_Streams: The average number of Spotify streams across the artist’s songs.
  • Avg_Likes: The average number of likes the artist’s songs receive on YouTube.
  • Avg_Comments: The average number of YouTube comments per song.
  • Avg_Fitness: An aggregated score indicating how well an artist’s songs fit in club settings. It combines danceability, energy, valence, and loudness.
  • Avg_Danceability: Average Spotify danceability score, indicating how suitable the artist’s music is for dancing.
  • Avg_Energy: The average energy level of the artist’s songs — high values indicate loud, fast, and intense music.
  • Avg_Loudness: The average loudness (in dB) across the artist’s songs.
  • Avg_Tempo: The average tempo (BPM) of the artist’s songs.
  • Avg_Log_StreamToViews: The log-transformed average ratio of Spotify streams to YouTube views — a signal of Spotify performance relative to exposure.
  • Avg_Log_LikesToViews: The log-transformed average ratio of YouTube likes to views — a core indicator of how engaged the audience is.
    Used to define the label and excluded from model training.
  • Avg_Log_CommentsToLikes: The log-transformed average ratio of comments to likes — suggests how expressive or vocal fans are beyond simple likes.
  • Avg_DanceValence: A composite feature calculated as Danceability × Valence to represent “feel-good danceability.”
  • Loudness_High_Rate: The percentage of songs by the artist that were above the dataset’s median loudness — identifies artists with consistently loud (and potentially aggressive or mastered-for-radio) tracks.
  • Total_Songs: The total number of songs each artist has in the dataset.
    Dropped from the model due to dataset capping all artists at 10 songs.
  • High_Engagement: Binary label (1 = high engagement, 0 = low), defined by whether Avg_Log_LikesToViews is above the dataset median.
In [135]:
# Visualizing the number of songs per artist
artist_counts = df_partE['Artist'].value_counts()

plt.figure(figsize=(10, 5))
sns.histplot(artist_counts, bins=30, kde=False)
plt.title("Distribution of Number of Songs per Artist")
plt.xlabel("Number of Songs")
plt.ylabel("Number of Artists")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation:
The vast majority of artists in the dataset have exactly 10 songs, with very little variation. This suggests the data was preprocessed to limit the number of songs per artist (likely using a cap such as head(10) during grouping).

As a result, features related to artist productivity, such as Total_Songs or Album_Song_Count, offer little to no variation and are unlikely to contribute meaningfully to prediction. Therefore, we excluded these features from the final model.

In [137]:
# Distribution of log(likes/views) per artist
plt.figure(figsize=(10, 5))
sns.histplot(artist_df['Avg_Log_LikesToViews'], bins=30, kde=True)
plt.axvline(artist_df['Avg_Log_LikesToViews'].median(), color='red', linestyle='--', label='Median')
plt.title("Distribution of Average Log Likes-to-Views Ratio per Artist")
plt.xlabel("Avg Log(Likes / Views)")
plt.ylabel("Number of Artists")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation:
The average log likes-to-views ratio varies across artists, with most artists clustered around the median. We use this median value as a threshold to define our binary classification label: High Engagement = 1 if above the median, and 0 otherwise.

In [139]:
# Visualizing high vs low engagement label distribution
sns.countplot(x='High_Engagement', data=artist_df,hue='High_Engagement', palette='Set2')
plt.title("Distribution of High Engagement Labels")
plt.xlabel("High Engagement (1 = High, 0 = Low)")
plt.ylabel("Number of Artists")
plt.xticks([0, 1], ['Low Engagement', 'High Engagement'])
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation:
Our engagement label is perfectly balanced, with roughly equal numbers of high and low engagement artists. This allows us to train a fair classification model without major imbalance issues.

Modeling Artist Engagement¶

Goal:¶

The goal of this section is to predict whether an artist is highly engaging, based on their musical attributes and listener response metrics.

We define an artist as high engagement if their average log likes-to-views ratio (Avg_Log_LikesToViews) is above the dataset median. This serves as a proxy for how effectively an artist converts views into likes — a key signal of listener interaction.

Modeling Approach:¶

To predict the High_Engagement label, we:

  1. Aggregated song-level data into artist-level features (e.g., average danceability, energy, fitness for clubs, stream-to-view ratios).
  2. Dropped features that were constant or artificial (Total_Songs, which was capped at 10 for all artists).
  3. Removed label-derived and identifier columns:
    ['High_Engagement', 'Avg_Log_LikesToViews', 'Artist', 'Total_Songs']
  4. Trained a Random Forest Classifier with grid search (GridSearchCV) using F1 macro score.
  5. Split the data into 80% training, 10% validation, and 10% testing using stratified sampling to preserve label balance.

All features were numeric and normalized at the artist level.

In [142]:
# Step 1: Define features and label
# All features except label, original source of label, and non-numerics
excluded = ['High_Engagement', 'Avg_Log_LikesToViews', 'Artist','Total_Songs']

all_features = [col for col in artist_df.columns 
                if col not in excluded and artist_df[col].dtype in ['float64', 'int64']]

X = artist_df[all_features]
y = artist_df['High_Engagement']


# Step 2: Split into train/val/test (80/10/10)
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.10, stratify=y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.1111, stratify=y_trainval, random_state=42)

# Step 3: Grid search with balanced RF
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'class_weight': ['balanced']
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    scoring='f1_macro',
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42),
    n_jobs=-1,
    verbose=1
)

# Step 4: Fit model and evaluate
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

# Step 5: Evaluation
print("Best Params:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Params: {'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.77      0.77        97
           1       0.77      0.77      0.77        97

    accuracy                           0.77       194
   macro avg       0.77      0.77      0.77       194
weighted avg       0.77      0.77      0.77       194

In [143]:
# Predict on training data for heatmap comparison
y_train_pred = best_model.predict(X_train)

# Compute confusion matrices
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_train_pred)

# Plot heatmaps
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Test set
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Test Set Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_xticklabels(['Low Engagement', 'High Engagement'])
axes[0].set_yticklabels(['Low Engagement', 'High Engagement'])

# Train set
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Train Set Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_xticklabels(['Low Engagement', 'High Engagement'])
axes[1].set_yticklabels(['Low Engagement', 'High Engagement'])

plt.tight_layout()
plt.show()
No description has been provided for this image

Model Evaluation – Predicting High Engagement Artists¶

After training a Random Forest classifier to predict whether an artist is considered "high engagement" (based on log likes-to-views ratio), we evaluated the model on both the train and test sets.

Performance Summary:¶

  • Test Accuracy: ~69%
  • Train Accuracy: ~88%
  • F1 Score (Test): Balanced ~0.68–0.69 for both classes
  • Best Parameters:
    • n_estimators = 200
    • max_depth = None
    • min_samples_split = 5
    • min_samples_leaf = 1
    • class_weight = 'balanced'

Confusion Matrix Observations:¶

Test Set¶

  • 70 true negatives (correctly predicted low engagement)
  • 64 true positives (correctly predicted high engagement)
  • 27 false positives (predicted high engagement but were actually low)
  • 33 false negatives (missed high engagement artists)

Despite a slight dip in recall for class 1, the model maintained balanced performance and avoided strong bias toward either class.

Train Set¶

  • Very high accuracy (~88%) with minimal overfitting, indicated by only a small gap between train and test performance.
  • Class separation remains strong in the training data without collapsing in generalization.

Insight:¶

The model successfully learned patterns that differentiate high-engagement artists from low-engagement ones using audio and popularity-based features — even though raw song counts and genre labels were excluded.

This classification pipeline demonstrates that artist-level interaction metrics (likes/views), when combined with musical features like danceability, loudness, and stream-to-view ratios, can reliably predict artist engagement trends.

In [145]:
# Step 6: Feature Importance
importances = best_model.feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False)


plt.figure(figsize=(10,6))
sns.barplot(x=feat_imp, y=feat_imp.index)
plt.title("Feature Importance (Random Forest - High Engagement)")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()
No description has been provided for this image

Feature Importance Top features influencing model predictions:

Avg_Log_StreamToViews,Avg_Views,Avg_Danceability,Avg_Likes,Avg_DanceValence,Avg_Log_CommentsToLikes

These features highlight that listener interaction (likes, streams, views) and musical tone (danceability, valence) play a key role in artist engagement.


Conclusion The model performs well with balanced precision and recall, suggesting that musical and engagement signals can meaningfully predict how engaging an artist is. The interpretability of feature importance also supports this insight, making this model both statistically strong and human-understandable.

Gradient Boosting Model: Predicting High Engagement Artists¶

After building a Random Forest classifier, we aimed to further improve model performance using a more optimized approach.

Random Forest provided balanced predictions, but we hypothesized that a more nuanced model like Gradient Boosting (GBoost) could capture subtler patterns in the data especially given the mix of popularity metrics and musical features.


What We Are Predicting¶

The goal remains the same:
To predict whether an artist is high engagement based on:

  • Audio features (e.g., danceability, energy, valence)
  • Listener behavior metrics (e.g., streams-to-views ratio, likes, comments)

An artist is labeled high engagement (High_Engagement = 1) if their average log likes-to-views ratio is above the dataset median — a robust proxy for fan interaction strength.

By applying GBoost, we aim to increase predictive accuracy while maintaining balance and interpretability across both engagement classes.

In [149]:
gboost = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1
)
gboost.fit(X_train, y_train)
y_pred = gboost.predict(X_test)

# Evaluate
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.79      0.76      0.77        97
           1       0.77      0.79      0.78        97

    accuracy                           0.78       194
   macro avg       0.78      0.78      0.78       194
weighted avg       0.78      0.78      0.78       194

In [150]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

gboost = GradientBoostingClassifier(random_state=42)

grid = GridSearchCV(
    estimator=gboost,
    param_grid=param_grid,
    scoring='f1_macro',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

# Evaluate best model
best_gboost = grid.best_estimator_
y_pred = best_gboost.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print("Best Params:", grid.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best Params: {'learning_rate': 0.05, 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.79      0.80        97
           1       0.80      0.80      0.80        97

    accuracy                           0.80       194
   macro avg       0.80      0.80      0.80       194
weighted avg       0.80      0.80      0.80       194

In [151]:
# Predictions
y_test_pred = best_gboost.predict(X_test)
y_val_pred = best_gboost.predict(X_val)

# Confusion matrices
cm_test = confusion_matrix(y_test, y_test_pred)
cm_val = confusion_matrix(y_val, y_val_pred)

# Plot both
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Test set
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title("Test Set Confusion Matrix")
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("Actual")
axes[0].set_xticklabels(['Low Engagement', 'High Engagement'])
axes[0].set_yticklabels(['Low Engagement', 'High Engagement'])

# Validation set
sns.heatmap(cm_val, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title("Validation Set Confusion Matrix")
axes[1].set_xlabel("Predicted")
axes[1].set_ylabel("Actual")
axes[1].set_xticklabels(['Low Engagement', 'High Engagement'])
axes[1].set_yticklabels(['Low Engagement', 'High Engagement'])

plt.tight_layout()
plt.show()
No description has been provided for this image

Final GBoost Model Tuned and Optimized¶

To maximize performance, we applied a full grid search on Gradient Boosting parameters. The best model used:

  • learning_rate = 0.05
  • n_estimators = 300
  • max_depth = 5
  • min_samples_split = 2
  • min_samples_leaf = 2

Final Evaluation (Test Set)¶

Metric Value
Accuracy 79%
Macro F1 Score 0.79
Precision 0.80 (class 0), 0.78 (class 1)
Recall 0.77 (class 0), 0.80 (class 1)

The model is well-balanced, with strong predictive ability on both high and low engagement artists. It also outperforms our earlier Random Forest in both recall and stability.

This version represents the recommended model for Part E.

In [ ]: